GENERALIZED LINEAR MODELS

Biostatistics Support and Research Unit

Germans Trias i Pujol Research Institute and Hospital (IGTP)
Badalona, Spain

April 2, 2025

Summary

1. Statistical models

2. Simple linear regression model

3. Multiple linear regression model

4. Logistic regression model

5. Take home messages

Statistical models

Statistical models

  • A statistical model establishes a relationship between a dependent variable and one or more independent variables by means of an equation:

    \[Y=f(X_1, \dots, X_k) + e\]

    • \(Y\): response, dependent, predicted variable

    • \(X_1, \dots, X_k\): predictors, independent, explanatory variables

    • \(e\): error term. Random variable with a given distribution. Part of \(Y\) that \(X_1, \dots, X_k\) are unable to explain.

Goals

  • Explanatory: Which variables \(X_i\) are important to explain \(Y\)? Which is the effect of \(X_i\) on \(Y\)?

    • Example: What is the effect of sugary drinks (\(X\)) on diabetes risk (\(Y\))?


  • Predictive: Predict \(Y\) given \(X_i\). Which value of \(Y\) can we expect for an individual with certain characteristics \(X_1, \dots, X_k\)?

    • Example: Estimate a clinical prediction model to predict the risk of diabetes (\(Y\)) given a subject’s demographic and clinical characteristics (\(X_i\))

Types of statistical models


  • Depending on the nature of the response variable \(Y\) there are different types of models:

    • \(Y\) continuous \(\longrightarrow\) Linear regression models

    • \(Y\) categorical with 2 categories \(\longrightarrow\) Logistic regression models

    • \(Y\) categorical with more than 2 categories \(\longrightarrow\) Multinomial regression models, ordinal regression models

    • \(Y\) counts \(\longrightarrow\) Poisson regression models

    • \(Y\) time to an event \(\longrightarrow\) Cox regression models

Types of statistical models


  • Depending on the nature of the response variable \(Y\) there are different types of models:

    • \(Y\) continuous \(\longrightarrow\) Linear regression models

    • \(Y\) categorical with 2 categories \(\longrightarrow\) Logistic regression models

    • \(Y\) categorical with more than 2 categories \(\longrightarrow\) Multinomial regression models, ordinal regression models

    • \(Y\) counts \(\longrightarrow\) Poisson regression models

    • \(Y\) time to an event \(\longrightarrow\) Cox regression models (next session)

Simple linear regression model

Introduction

  • A linear regression model establishes a relationship between:

    • A quantitative response variable \(Y\)

    • Quantitative or qualitative predictor variables \(X_1, \dots, X_k\)

  • Formula specification

    \[Y=\beta_{0}+\beta_{1}X_1+\beta_{2}X_2+\ldots+\beta_{k}X_k+e\]

    • Assuming that residuals \(e\)
      • Follow a normal distribution
      • Have mean \(0\) and constant variance \(\sigma^2\)

Simple linear regression

  • Goal: To quantify the relationship between a dependent variable \(Y\) and an independent variable \(X\)

  • \(Y\): Dependent/response/outcome variable

  • \(X\): Independent/predictor/explanatory variable

  • Linear relationship:

\[Y= \beta_0 + \beta_1 X + e\]

  • \(\beta_0\) is the intercept (constant/independent) term.
  • \(\beta_1\) is the slope, the change in \(Y\) when \(X\) increases by one unit.
  • \(e\) is the error term \(\longrightarrow\) part of the \(Y\) that is not explained by the predictor variable in the regression model.

Coefficient interpretation

  • \(\beta_0\) \(\longrightarrow\) \(Y\) expected value when \(X=0\).

  • \(\beta_1\) > 0 \(\longrightarrow\) When \(X\) increases, \(Y\) increases.

  • \(\beta_1\) < 0 \(\longrightarrow\) When \(X\) increases, \(Y\) decreases.

  • We can do a hypothesis test on \(\beta_1\) to test whether there is statistically significant evidence that X is associated with Y:

    • \(H_0: \beta_1=0\)
    • \(H_1: \beta_1\neq0\)

Example dataset: REGICOR study

First, install and load the {compareGroups} package into your R session

install.packages("compareGroups")
library(compareGroups)


We will work with a sample dataset from this package:

data(regicor) #?regicor

This dataset contains data from 2294 individuals from the REGICOR study (Registre Gironí del COR).

Example dataset: REGICOR study

library(gt)
gt(head(regicor, 3))
id year age sex smoker sbp dbp histhtn txhtn chol hdl triglyc ldl histchol txchol height weight bmi phyact pcs mcs cv tocv death todeath
2265 2005 70 Female Never smoker 138 75 No No 294 57.00000 93 218.4000 No No 160 64 25.00000 304.2000 54.455 58.918 No 1024.882 Yes 1299.16343
1882 2005 56 Female Never smoker 139 89 No No 220 50.00000 160 138.0000 No No 163 67 25.21736 160.3000 58.165 47.995 No 2756.849 No 39.32629
3000105616 2000 37 Male Current or former < 1y 132 82 No No 245 59.80429 89 167.3957 No No 170 70 24.22145 552.7912 43.429 62.585 No 1905.969 No 858.42203

Simple linear regression

We want to study which variables are associated with systolic blood pressure (SBP).

  • \(Y\): systolic blood pressure (sbp)

  • Let’s start with age as a predictor variable.

  • We want to find \(\beta_0\) and \(\beta_1\) so that \[\mbox{sbp} = \beta_0 + \beta_1\mbox{age} + e\]

  • But first, let’s look at the relationship between these two variables.

Simple linear regression

  • First, we can plot the values of these two variables to see if they have a relationship and if it is linear.
library(ggplot2)
ggplot(regicor, aes(x=age, y=sbp))+
  geom_point() +
  labs(x = "Age", 
       y = "Systolic blood pressure")+
  theme_bw()

Correlation

  • We can compute the Pearson’s correlation coefficient to see how their association is.

  • cor.test() function computes the correlation coefficient, its 95% CI and the p-value.

cor.test(regicor$sbp, regicor$age)

    Pearson's product-moment correlation

data:  regicor$sbp and regicor$age
t = 26.772, df = 2278, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4573471 0.5198247
sample estimates:
      cor 
0.4892133 

Correlation

  • \(\rho\)=0.489, 95%CI [0.456; 0.520] \(\longrightarrow\) Moderate positive correlation

  • p-value tests the hypothesis

    • \(H_0: \rho=0\)
    • \(H_1: \rho\neq0\)

p-value in a correlation test

  • p-value only tests if the correlation is different from 0.
  • In large samples, small correlations can lead to misleadingly significant p-values.
  • Whenever possible, use the \(\rho\) and the 95% CI.

Correlation

  • \(\rho\) takes values between -1 and 1

Correlation

  • Pearson’s correlation coefficient assumes that X and Y follow a normal distribution

  • Alternative: Spearman’s correlation coefficient (argument method in cor.test())

  • With a correlation matrix we can also test several variables at the same time

Correlation is not causation

  • When the correlation between two variables is very high (correlation coefficients close to 1 or -1), it is tempting to assume that this means that one variable causes the other

  • Correlation is about the relationship between two variables, not about which variable causes the other.

  • Check the website: https://www.tylervigen.com/spurious-correlations

Correlation: interpretation errors

Anscombe’s quartet

  • All these pairs of variables have the same correlation (0.816), but be careful with the interpretation!

Simple linear regression

  • Following with our example, we want to estimate the regression line
library(ggplot2)
ggplot(regicor, aes(x=age, y=sbp)) +
  geom_point() +
  geom_smooth(method='lm', formula= y~x) +
  labs(x = "Age", y = "Systolic blood pressure")+
  theme_bw()

Simple linear regression

  • We can compute the coefficients of the regression model with lm() function, specifying the dependent and the independent variable.
m1 <- lm(sbp ~ age, data=regicor)
summary(m1)

Call:
lm(formula = sbp ~ age, data = regicor)

Residuals:
    Min      1Q  Median      3Q     Max 
-48.642 -12.825  -1.721  11.363  93.962 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 81.87772    1.87836   43.59   <2e-16 ***
age          0.90103    0.03366   26.77   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.71 on 2278 degrees of freedom
  (14 observations deleted due to missingness)
Multiple R-squared:  0.2393,    Adjusted R-squared:  0.239 
F-statistic: 716.7 on 1 and 2278 DF,  p-value: < 2.2e-16

Simple linear regression

library(gtsummary)
m1 |> 
  tbl_regression(intercept=TRUE)
Characteristic Beta 95% CI p-value
(Intercept) 82 78, 86 <0.001
age 0.90 0.84, 0.97 <0.001
Abbreviation: CI = Confidence Interval

Simple linear regression

m1 |> 
  tbl_regression(intercept=TRUE)
Characteristic Beta 95% CI p-value
(Intercept) 82 78, 86 <0.001
age 0.90 0.84, 0.97 <0.001
Abbreviation: CI = Confidence Interval
  • \(\mbox{sbp} = 82 + 0.9\cdot\mbox{age}\)

  • \(\beta_0=82\): the expected value of systolic blood pressure when age = 0 (senseless)

  • p-value for age \(<0.001 \longrightarrow\) Age has a statistically significant effect on SBP

  • \(\beta_1 > 0\): if age increases, SBP increases.

  • \(\beta_1 = 0.9\): the expected SBP value for a subject with one year more would be 0.9 mmHg higher.

  • For a subject with 5 more years the increase in SBP will be of \(0.9 \cdot 5 = 4.5\) mmHg.

Goodness of fit

  • The goodness of fit of a regression model is assessed by the \(\mbox{R}^2\)

  • It is the percentage of the variability of \(Y\) explained by the model

  • It is a non-dimensional value that ranges from 0 to 1

    • 0 indicates that the model explains none of the variability of the response
    • 1 indicates that the model explains all the variability of the response
  • We can add the \(\mbox{R}^2\) to our results table using add_glance_table().

m1 |> 
  tbl_regression(intercept=TRUE) |> 
  add_glance_table(include = r.squared)
Characteristic Beta 95% CI p-value
(Intercept) 82 78, 86 <0.001
age 0.90 0.84, 0.97 <0.001
0.239

Abbreviation: CI = Confidence Interval

Model diagnostics

  • After building the model, we have to check if it is appropriate for our data, by means of:

    • Normality of residuals
    • Relation between the residuals and the predicted values
    • Homocedasticity: error variance constant across observations
    • Assessment of outliers
par(mfrow=c(2,2)) # Change the panel layout to 2 x 2
plot(m1)

Normality of residuals

plot(m1,2)
  • Dots should be in the line

  • Deviations allowed in the extremes

Residuals vs predicted values

plot(m1,1)

A correct pattern of residuals (\(e= \hat{Y} - Y\)):

  • Residuals randomly allocated around 0

  • The red line goes over the 0 value

  • The are no “extreme” residuals

Homocedasticity

plot(m1,3)

The assumption of homoscedasticity is satisfied if:

  • Red line approximately horizontal across the plot

  • There is no clear pattern among the residuals

Assessing outliers

plot(m1,4)

We consider a point to be influential when the Cook’s distance is larger than 1. But always take into account the context of your data.

Multiple linear regression model

Multiple linear regression

  • Multiple linear regression is the generalization of a simple linear regression model by incorporating more than one predictor variable

  • Useful to investigate the effect of a set of predictor variables in a response variable

  • Useful to study the effect of one predictor variable adjusted by a set of covariates.

Multiple linear regression

  • Given

    • A quantitative response variable \(Y\)
    • Quantitative or qualitative predictor variables \(X_1, \dots, X_k\)
  • Formula specification: \[Y=\beta_{0}+\beta_{1}X_1+\beta_{2}X_2+\ldots+\beta_{k}X_k+e\]

    • Assuming that residuals \(e\)
      • Follow a normal distribution
      • Have mean \(0\) and constant variance \(\sigma^2\)

Multiple linear regression

  • Coefficients \(\beta_i\) give us information about the effect of \(X_i\) on \(Y\)

  • If \(X_i\) is quantitative, \(\beta_i\) represents the change in the Y value when the variable \(X_i\) increases in one unit (when all other predictors are held constant):

    • If \(\beta_i > 0 \longrightarrow\) When \(X_i\) increases, \(Y\) increases.
    • If \(\beta_i < 0 \longrightarrow\) When \(X_i\) increases, \(Y\) decreases.

Multiple linear regression

  • If \(X_i\) is qualitative, \(\beta_i\) represents the change in the Y value with respect to the reference category of \(X_i\) (when all other predictors are held constant).

    • If \(\beta_i > 0 \longrightarrow\) If we change from the reference category to \(X_i\)’s category, \(Y\) increases.
    • If \(\beta_i < 0 \longrightarrow\) If we change from the reference category to \(X_i\)’s category, \(Y\) decreases.

Multiple linear regression

Let’s continue with the regicor dataset to analyze the effect of age (quantitative) and sex (qualitative) on systolic blood pressure

m2 <- lm(sbp ~ age+sex, data=regicor)
m2 |> 
  tbl_regression(intercept=TRUE)
Characteristic Beta 95% CI p-value
(Intercept) 85 81, 88 <0.001
age 0.90 0.83, 0.96 <0.001
Sex


    Male
    Female -5.4 -6.9, -4.0 <0.001
Abbreviation: CI = Confidence Interval

\[\mbox{SBP}=85+0.9\cdot\{\mbox{age}\}-5.4\cdot \{\mbox{sex=Female}\}\]

Multiple linear regression

m2 |> 
  tbl_regression(intercept=TRUE)
Characteristic Beta 95% CI p-value
(Intercept) 85 81, 88 <0.001
age 0.90 0.83, 0.96 <0.001
Sex


    Male
    Female -5.4 -6.9, -4.0 <0.001
Abbreviation: CI = Confidence Interval
  • Both p-values are lower than 0.05 \(\longrightarrow\) Both age and sex are significantly associated with SBP

  • \(\beta_{age} = 0.90 > 0 \longrightarrow\) When age increases, SBP increases \(\longrightarrow\) For a one-year increase, SBP increases in 0.90 mmHg.

  • \(\beta_{female} = -5.4 < 0 \longrightarrow\) Females have lower SBP than Males \(\longrightarrow\) Females are expected to have 5.4 mmHg less SBP than males.

Goodness of fit

  • Adding significant variables increase the predictive ability of the model.

  • If we add more variables, \(R^2\) will increase.

  • Adjusted \(R^2\): modified version of \(R^2\) that is adjusted for the number of predictors.

m2 |> 
  tbl_regression(intercept=TRUE)|> 
  add_glance_table(include = c(r.squared, adj.r.squared))
Characteristic Beta 95% CI p-value
(Intercept) 85 81, 88 <0.001
age 0.90 0.83, 0.96 <0.001
Sex


    Male
    Female -5.4 -6.9, -4.0 <0.001
0.257

Adjusted R² 0.256

Abbreviation: CI = Confidence Interval

Predictions

  • We can predict the value of SBP given the characteristics of an individual with predict() function.

  • If no data argument is specified, the probabilities are computed for the data used to fit the model.

library(dplyr)
regicor2 <- regicor |> 
  filter(!is.na(sbp) & !is.na(age) & !is.na(sex))  #we can't predict with missing data
regicor2 <- regicor2 |>
  mutate(pred = predict(m2,  type="response", data=regicor2)) |> #store predictions
  select(id,age,sex,pred) #select variables to visualize predictions 
head(regicor2,4)
          id age    sex     pred
1       2265  70 Female 142.3198
2       1882  56 Female 129.7232
3 3000105616  37   Male 118.0482
4 3000103485  69 Female 141.4200

Predictions

  • We can also make predictions for a certain profile of patients specifying the argument newdata.
#prediction
newdata <- data.frame(age=70, sex="Male") #define new data
newdata |>
  mutate(pred = predict(m2,  type="response", newdata=newdata))
  age  sex     pred
1  70 Male 147.7402
newdata2 <- data.frame(age=70, sex=c("Male", "Female")) #define new data
newdata2 |>
  mutate(pred = predict(m2,  type="response", newdata=newdata2))
  age    sex     pred
1  70   Male 147.7402
2  70 Female 142.3198

Comparison of nested linear models

  • anova()function allows us to compare two nested models to see if there are differences
anova(m1, m2)
Analysis of Variance Table

Model 1: sbp ~ age
Model 2: sbp ~ age + sex
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1   2278 714849                                  
2   2277 698125  1     16724 54.547 2.119e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • p-value contrast the hypothesis of equal explicability of the models.

  • p-value < 0.05 \(\longrightarrow\) Models m1 and m2 are different \(\longrightarrow\) Sex explains a part of SBP, independent of age.

Comparison of linear models

  • Models (nested or not) can also be compared with the AIC.

  • Deviance is a measure of the lack of fit to the data in a regression model.

  • Akaike information criterion (AIC) \[\mbox{AIC} = \mbox{Deviance} + 2k,\] where k is the number of parameters in the model.

    • AIC penalizes for model complexity.

    • Smaller values indicate better fit.

    • The AIC for a model can be computed with AIC() function.

Comparison of linear models

AIC(m1)
[1] 19581.56
AIC(m2)
[1] 19529.59
  • AIC for model m2 is lower

  • m2 (age and sex) fits the data better than m1 (age).

Other features of tbl_regression()

  • tbl_regression() allows us to add some extra information (number of observations, AIC, etc)
m2 |> 
  tbl_regression(intercept=TRUE) |>
  add_glance_table(
    include = c(nobs, r.squared, adj.r.squared, AIC))
Characteristic Beta 95% CI p-value
(Intercept) 85 81, 88 <0.001
age 0.90 0.83, 0.96 <0.001
Sex


    Male
    Female -5.4 -6.9, -4.0 <0.001
No. Obs. 2,280

0.257

Adjusted R² 0.256

AIC 19,530

Abbreviation: CI = Confidence Interval

Other features of tbl_regression()

  • We can visualize two models at the same time with tbl_merge() function
t1 <- m1 |> 
  tbl_regression(intercept=TRUE)
t2 <- m2 |> 
  tbl_regression(intercept=TRUE)

tbl_merge(
  list(t1, t2),
  tab_spanner = c("**Model with sbp**", "**Model with sbp and sex**"))
Characteristic
Model with sbp
Model with sbp and sex
Beta 95% CI p-value Beta 95% CI p-value
(Intercept) 82 78, 86 <0.001 85 81, 88 <0.001
age 0.90 0.84, 0.97 <0.001 0.90 0.83, 0.96 <0.001
Sex





    Male



    Female


-5.4 -6.9, -4.0 <0.001
Abbreviation: CI = Confidence Interval

Logistic regression models

Introduction

  • In many studies, the outcome variable of interest is the presence or absence of some event:

    • Treatment response (yes, no)
    • Test for some infection (positive, negative)
    • Death at 5 years (yes, no)
  • Outcome: binary/dichotomous variable \(\longrightarrow\) Logistic regression

  • Goal: describe the relationship between the binary outcome and several explanatory variables.

Examples

  • Evaluating whether a patient responds to a treatment for hypertension.

  • Identifying patients at risk of ischemic stroke based on lifestyle and medical data.

  • Predicting whether a patient has diabetes based on glucose levels, BMI, and insulin levels.

Introduction

  • Let \(Y\) be a binary variable such that
    • \(Y=1\) if the outcome is present
    • \(Y=0\) if the outcome is not present
  • A linear regression model

\[Y=\beta_0+\beta_1 X_1+\cdots+\beta_k X_k+e\]

it’s not what we need.

Example

  • We want to to study the relationship between exitus and a quantitative biomarker.
  • Outcome of interest: Exitus (1 if the subject dies, 0 otherwise).

Example

  • We can add the linear regression line, but…

Example

  • It would be nicer to have…

Modelization

  • In logistic regression model we wanto to study the probability of the outcome \(P(Y=1)=p\), a value between 0 and 1.

  • We can compute the odds, which is the ratio of the probability of an event occurring divided by the probability of that event not occurring

\[\frac{p}{1-p}=\frac{P(Y=1)}{1-P(Y=1)}\]

  • And we then compute the log-odds:

\[\ln\left(\frac{p}{1-p}\right)=\ln\left(\frac{P(Y=1)}{1-P(Y=1)}\right)\]

Modelization

  • What we model in logistic regression is the log-odds of the probability of the event:

\[\ln\left(\frac{p}{1-p}\right) = \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k\]

\(p=P(Y=1)\)

\(X_1, \ldots, X_k\): Variables of interest.

\(\beta_1, \ldots, \beta_k\): Regression coefficients of each variable.

Modelization

  • \(\beta_1, \ldots, \beta_k\): indicate the magnitude of the effect of the corresponding variable.

  • For \(i=1,\ldots,k\), the \(p\)-values associated with these estimates correspond to

    • \(H_0: \beta_i=0\)
    • \(H_1: \beta_i\neq0\)

and they indicate whether the \(i\)-th variable has a statistically significant effect on the odds of Y=1.

Odds ratios

  • The odds ratio (OR) for the \(i\)-th variable is \(\exp(\beta_i) = e^{\beta_i}\)

    • Quantitative variables: Change in the odds of the event occurring for each unit increase in the variable.

    • Qualitative variables: Change in the odds of the event occurring for each category of the variable with respect to the reference category.

OR interpretation

  • \(\mbox{Odds ratio}>1 \rightarrow\) The odds of the event occurring increases.

  • \(\mbox{Odds ratio}<1 \rightarrow\) The odds of the event occurring decreases.

  • \(\mbox{Odds ratio}=1 \rightarrow\) The odds of the event occurring remains the same.

Logistic model (quantitative variable)

  • Going back to the regicor database, we want to study if bmi has an effect on death risk.

  • We want to fit a logistic model

\[\ln\left(\frac{P(\mbox{death})}{1- P(\mbox{death})}\right)= \beta_0 + \beta_1 \cdot \mbox{bmi}\] ## Logistic model (quantitative variable)

  • We can estimate the logistic model with glm() function, specifying the outcome, the adjustment variables, the data and family binomial.
#filter for patients without the outcome missing
library(dplyr)
regicor <- regicor |> 
  filter(!is.na(death))
m1 <- glm(death ~ bmi, data=regicor, family=binomial)

Logistic model (quantitative variable)

summary(m1)

Call:
glm(formula = death ~ bmi, family = binomial, data = regicor)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.83607    0.48094 -12.135  < 2e-16 ***
bmi          0.11882    0.01592   7.466  8.3e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1199.4  on 2124  degrees of freedom
Residual deviance: 1145.2  on 2123  degrees of freedom
  (23 observations deleted due to missingness)
AIC: 1149.2

Number of Fisher Scoring iterations: 5

Logistic model (quantitative variable)

We can visualise the model results in a fancy table:

  • Using tbl_regression() function from {gtsummary} package

  • Specifying exp=TRUE to show Odds Ratio instead of log-odds.

m1 |> 
  tbl_regression(exp = TRUE)
Characteristic OR 95% CI p-value
bmi 1.13 1.09, 1.16 <0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
  • p.value<0.001 \(\longrightarrow\) There is a statistically significant effect of bmi on death.

Logistic model (quantitative variable)

m1 |> 
  tbl_regression(exp = TRUE)
Characteristic OR 95% CI p-value
bmi 1.13 1.09, 1.16 <0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
  • OR > 1 \(\longrightarrow\) If bmi increases, the odds of dying increase.

  • OR = 1.13 \(\longrightarrow\) For a 1-unit increase in the bmi, the odds of dying are multiplied by 1.13 \(\longrightarrow\) For a 1-unit increase in the bmi, the odds of dying are increased by 13%.

  • OR for a 5-units increase: \[e^{5 \cdot \beta_{bmi}} = (e^{\beta_{bmi}})^5=(\mbox{OR}_{bmi})^5=1.13^5=1.84\]

Predictions

  • We can predict the probability of death for an individual given its bmi with the predict() function using type=response.

  • If no data argument is specified, the probabilities are computed for the data used to fit the model.

regicor2 <- regicor |>
  filter(!is.na(bmi))
regicor2 <- regicor2 |>
  mutate(pred = predict(m1,  type="response", data=regicor2)) |>
  select(id,bmi,death,pred)
head(regicor2)
          id      bmi death       pred
1       2265 25.00000   Yes 0.05388009
2       1882 25.21736    No 0.05521190
3 3000105616 24.22145    No 0.04935442
4 3000103485 31.46837    No 0.10938419
5 3000100883 17.42509    No 0.02262871
6 3000108476 29.44676   Yes 0.08808444

Logistic model (qualitative variable)

  • We want to study the association of smoking and death.

  • Smoking has three categories: Never smoker, Current or former < 1y and Former >= 1y. Let’s see how the categories are ordered:

levels(regicor$smoker)
[1] "Never smoker"           "Current or former < 1y" "Former >= 1y"          
  • Let’s reorder the levels
regicor$smoker <- factor(regicor$smoker, 
                         levels=c("Never smoker","Former >= 1y",
                                  "Current or former < 1y"),
                         labels=c("Never smoker","Former >= 1y",
                                  "Current or former < 1y"))
levels(regicor$smoker)
[1] "Never smoker"           "Former >= 1y"           "Current or former < 1y"

Logistic model (qualitative variable)

  • With Never smoker being the reference category, we are going to build the model

\[\ln\left(\frac{P(\mbox{death})}{1- P(\mbox{death})}\right)= \beta_0 + \beta_1 \cdot \mbox{\{Former >= 1y\}} + \beta_2 \cdot \mbox{\{Current or former < 1y\}}\]

Logistic model (qualitative variable)

m2 <- glm(death ~ smoker, data=regicor, family=binomial)
summary(m2)

Call:
glm(formula = death ~ smoker, family = binomial, data = regicor)

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)    
(Intercept)                   -2.5041     0.1122 -22.327  < 2e-16 ***
smokerFormer >= 1y            -0.5895     0.2658  -2.218  0.02656 *  
smokerCurrent or former < 1y   0.5479     0.1706   3.211  0.00132 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1196.8  on 2109  degrees of freedom
Residual deviance: 1175.0  on 2107  degrees of freedom
  (38 observations deleted due to missingness)
AIC: 1181

Number of Fisher Scoring iterations: 5

Logistic model (qualitative variable)

m2 |> 
  tbl_regression(exp = TRUE)
Characteristic OR 95% CI p-value
smoker


    Never smoker
    Former >= 1y 0.55 0.32, 0.91 0.027
    Current or former < 1y 1.73 1.24, 2.41 0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
  • For Former >=1y, OR < 1 \(\longrightarrow\) Former smokers have less odds of dying than Never smokers.

  • For Current or former < 1y, OR > 1 \(\longrightarrow\) Current or recent former smoker have more odds of dying than Never smokers.

Logistic model (qualitative and quantitative variable)

  • If we now add bmi to the model:
m3 <- glm(death ~ smoker + bmi, data=regicor, family=binomial)
m3 |> 
  tbl_regression(exp = TRUE)
Characteristic OR 95% CI p-value
smoker


    Never smoker
    Former >= 1y 0.57 0.33, 0.94 0.035
    Current or former < 1y 2.09 1.47, 2.95 <0.001
bmi 1.14 1.10, 1.17 <0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Predictions

  • We can compute the predictions of this model
regicor3 <- regicor |>
  filter(!is.na(smoker) & !is.na(bmi))
regicor3 <- regicor3 |>
  mutate(pred = predict(m3,  type="response", data=regicor3)) |>
  select(id,bmi,death,pred, smoker)
head(regicor3)
          id      bmi death       pred                 smoker
1       2265 25.00000   Yes 0.04586578           Never smoker
2       1882 25.21736    No 0.04710176           Never smoker
3 3000105616 24.22145    No 0.08324754 Current or former < 1y
4 3000103485 31.46837    No 0.09928560           Never smoker
5 3000100883 17.42509    No 0.03657965 Current or former < 1y
6 3000108476 29.44676   Yes 0.07838014           Never smoker

Predictions

  • And we can plot them with ggplot() function from {ggplot2} package.
ggplot(regicor3, aes(x = bmi, y = pred, color = smoker)) +
  geom_smooth()

Predictions

  • And we can customize the plot
ggplot(regicor3, aes(x = bmi, y = pred, color = smoker)) +
  geom_smooth(se = FALSE) + 
  labs(x = "BMI", y = "Predicted probability of death",
        color = "Smoking status") +  # Change legend title and axis labels
  theme_bw() +
  theme(legend.position = "top")  # Move legend to the top

Predictive accuracy

  • Explicar ROC curve? (no sé si hi ha temps)

Take home messages

Take home messages

  • Statistical models are a useful way to analyze the relationship between an outcome variable and several predictor variables.

  • Adjust for confounders when evaluating intervention effects.

  • Use DAGs when assessing causal relationships.

  • Exploratory analyses allow more flexibility, but model assumptions must always be checked.

  • Continuous variable \(\longrightarrow\) Linear regression

  • Dichotomous variable \(\longrightarrow\) Logistic regression

  • Internally and externally validate predictive models.

  • Always check the residuals and the goodness of fit of models


📅 Thanks & See You on Friday! 👋